Online Markov decision processes with policy iteration

نویسندگان

  • Yao Ma
  • Hao Zhang
  • Masashi Sugiyama
چکیده

The online Markov decision process (MDP) is a generalization of the classical Markov decision process that incorporates changing reward functions. In this paper, we propose practical online MDP algorithms with policy iteration and theoretically establish a sublinear regret bound. A notable advantage of the proposed algorithm is that it can be easily combined with function approximation, and thus large and possibly continuous state spaces can be efficiently handled. Through experiments, we demonstrate the usefulness of the proposed algorithm.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exponential Lower Bounds for Policy Iteration

We study policy iteration for infinite-horizon Markov decision processes. It has recently been shown policy iteration style algorithms have exponential lower bounds in a two player game setting. We extend these lower bounds to Markov decision processes with the total reward and average-reward optimality criteria.

متن کامل

Convergence Analysis of Kernel-based On-policy Approximate Policy Iteration Algorithms for Markov Decision Processes with Continuous, Multidimensional States and Actions

Using kernel smoothing techniques, we propose three different online, on-policy approximate policy iteration algorithms which can be applied to infinite horizon problems with continuous and vector-valued states and actions. Using Monte Carlo sampling to estimate the value function around the post-decision state, we reduce the problem to a sequence of deterministic, nonlinear programming problem...

متن کامل

A Unified Approach to Algorithms with a Suboptimality Test in Discounted Semi-markov Decision Processes

This paper deals with computational algorithms for obtaining the optimal stationary policy and the minimum cost of a discounted semi-Markov decision process. Van Nunen [23) has proposed a modified policy iteration algorithm with a suboptimality test of MacQueen type, where the modified policy iteration algorithm is policy iteration method with the policy evaluation routine by a finite number of...

متن کامل

Markov Chain Anticipation for the Online Traveling Salesman Problem by Simulated Annealing Algorithm

The arc costs are assumed to be online parameters of the network and decisions should be made while the costs of arcs are not known. The policies determine the permitted nodes and arcs to traverse and they are generally defined according to the departure nodes of the current policy nodes. In on-line created tours arc costs are not available for decision makers. The on-line traversed nodes are f...

متن کامل

Convergence of Simulation-Based Policy Iteration

Simulation-based policy iteration (SBPI) is a modification of the policy iteration algorithm for computing optimal policies for Markov decision processes. At each iteration, rather than solving the average evaluation equations, SBPI employs simulation to estimate a solution to these equations. For recurrent average-reward Markov decision processes with finite state and action spaces, we provide...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1510.04454  شماره 

صفحات  -

تاریخ انتشار 2015